!pip install python-dotenv
# Installs Unsloth, Xformers (Flash Attention) and all other packages!
!pip install "unsloth[cu121-ampere-torch230] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps "xformers<0.0.27" "trl<0.9.0" peft accelerate bitsandbytesORPO is an innovative fine-tuning method that merges traditional supervised fine-tuning with preference alignment into a unified process. This approach decreases the computational resources and time needed for training. Additionally, empirical evidence shows that ORPO surpasses other alignment techniques across different model sizes and benchmarks.
In this article, we will fine-tune the newest Mistral 7B model using ORPO and the TRL library. The code is accessible on Google Colab and in the LLM Tutorial on GitHub.
ORPO
Instruction tuning and preference alignment are crucial methods for customizing Large Language Models (LLMs) for particular tasks. Typically, this entails a multi-step process: first, Supervised Fine-Tuning (SFT) on instructions to tailor the model to the desired domain, and second, applying preference alignment techniques such as Reinforcement Learning with Human Feedback (RLHF) or Direct Preference Optimization (DPO) to enhance the probability of producing preferred responses over less desirable ones.
Researchers have discovered a drawback in this method. Although SFT successfully adjusts the model to the target domain, it also unintentionally raises the chances of producing both unwanted and desired answers. Therefore, the preference alignment stage is essential to enlarge the disparity between the probabilities of accepted and rejected outputs.
Hong and Lee (2024) presented ORPO, an innovative approach that combines instruction tuning and preference alignment into a single training framework. ORPO modifies the conventional language modeling objective by incorporating the negative log-likelihood loss with an odds ratio (OR) component. This OR loss applies a mild penalty to rejected responses while greatly rewarding preferred ones, allowing the model to simultaneously learn the target task and align with human preferences.
\mathscr{L}{ORPO} = \mathbb{E}{(x, y_{w}, y_l)}[\mathscr{L}{SFT} + \lambda \cdot \mathscr{L}{OR}]
ORPO has been integrated into key fine-tuning libraries such as TRL, Axolotl, and LLaMA-Factory. The following section will demonstrate its usage with TRL.
Fine-Tuning Mistral v0.3 with ORPO and Unsoth
Mistral AI’s v0.3 is a significant update to their AI model, introducing improved performance and efficiency. This version includes enhanced instruction-following capabilities, making interactions more intuitive. Additionally, Mistral v0.3 incorporates advanced reasoning skills, enabling it to tackle complex tasks more effectively. The update also extends the context length to 32768 tokens, allowing for more detailed and coherent conversations. Technical details include an extended vocabulary (32000 to 32768), a new tokenizer, and support for function calling.
ORPO necessitates a preference dataset that includes a prompt, a selected answer, and a discarded answer. To achieve this, we will utilize llmat/dpo-orpo-mix-38k-balanced, a dataset that merges high-quality DPO datasets and has been further balanced using a clustering-based approach.
To efficiently fine-tune our model we will use the unlsoth library. Unsloth significantly improves speed and efficiency in the training of Large Language Models (LLMs). The speed and efficiency gains are achieved through several optimizations, including manual autograd and chained matrix multiplication. Furthermore, it utilizes Flash Attention via xformers and Tri Dao’s implementation, which is a highly optimized approach to handling attention mechanisms in transformer models. Unsloth makes fine-tuning 2 times faster with 50% less memory usage.
Let’s start by installing the required libraries:
Now let’s login to our W&B workspace
import wandb
import os
import dotenv
dotenv.load_dotenv()
%env WANDB_NOTEBOOK_NAME = $Fine_tune_Mistral_with_ORPO
wandb.login(key=os.environ["WANDB_API_KEY"])Load the Model and Tokenizer for LoRA
In the following, we will load the Mistral 7B v0.3 model in 4-bit precision using bitsandbytes.
cache_dir = './model'
model_id = 'mistralai/Mistral-7B-v0.3'from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = model_id,
max_seq_length = max_seq_length,
dtype = dtype,
load_in_4bit = load_in_4bit,
)Loading Checks
After loading the model, it’s crucial to ensure that all parameters are correctly placed on the GPU and that none are overflowing onto the CPU. This can be particularly important for large models where memory management is critical.
To verify this, you can iterate through the model’s named parameters and check their device type. If any parameter is on the CPU (indicated by the device type ‘meta’), it will be printed out.
Here is the code to perform this check:
# Check there are no parameters overflowing onto cpu (meta).
for n, p in model.named_parameters():
if p.device.type=='meta':
print(f"{n} is on meta!")Prepare for LoRA fine-tuning
Before starting the LoRA (Low-Rank Adaptation) fine-tuning process, it’s essential to understand which parameters in your model are trainable and which are not. This helps in ensuring that only the desired parameters are updated during training, which is crucial for efficient and effective fine-tuning.
To achieve this, you can use the following function to print the number of trainable parameters in the model and list which parameters are trainable and which are not.
Here is the code to perform this check:
def print_trainable_parameters(model):
"""
Prints the number of trainablöe parameters in the model and lists which parameters
"""
trainable_params = 0
non_trainable_params = 0
all_params = 0
print("Trainable Parameters")
for name, param in model.named_parameters():
all_params += param.numel()
if param.requires_grad:
trainable_params += param.numel()
print(f" {name}")
else:
non_trainable_params += param.numel()
print("\nNon-Trainable Parameters:")
for name, param in model.named_parameters():
if not param.requires_grad:
print(f" {name}")
print(
f"\nSummary:\n Trainable params: {trainable_params}\n Non-Trainable params: {non_trainable_params}\n All Parameters: {all_params}")
Let’s take a look a the model
print(model)Setting Up LoRA Fine-Tuning
To prepare your model for LoRA (Low-Rank Adaptation) fine-tuning, you need to configure it properly. This involves setting up the LoRA configuration. Here’s a brief overview of the parameters and their best settings:
r: This parameter controls the rank of the low-rank adaptation matrices. It’s suggested to choose a value greater than 0, with common choices being 8, 16, 32, 64, or 128. The best setting depends on the specific use case and computational resources, but a good starting point is 8 or 16.lora_alpha: This parameter scales the magnitude of the LoRA update. A higher value can lead to more significant changes in the model’s behavior. The best setting is typically 32, as used in the code.target_modules: This list specifies which modules in the model should be fine-tuned. The best settings include key modules like"q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj", and"down_proj". If the task involves chat fine-tuning, it’s also beneficial to set"lm_head"(language model head) as trainable.use_gradient_checkpointing: This parameter activates gradient checkpointing to conserve memory. It is managed by Unsloth, which offloads input and output embeddings to disk, thereby saving VRAM.random_state: This parameter sets the seed for random number generation, ensuring reproducibility. The best setting is any integer value; in the code, it’s set to 3407.use_rslora: This parameter activates RSLoRA, which adjusts the scaling factor of LoRA adapters to be proportional to 1/√r instead of 1/r. This adjustment enhances the stability of learning, particularly for higher adapter ranks, and improves fine-tuning performance as the rank increases.
These settings provide a good starting point for fine-tuning a language model using PEFT. However, the optimal settings may vary depending on the specific task and dataset, so some experimentation may be necessary.
model = FastLanguageModel.get_peft_model(
model,
r = 8, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
lora_alpha = 32,
target_modules=[
"q_proj",
"k_proj",
"v_proj",
"o_proj",
"gate_proj",
"up_proj",
"down_proj",
"lm_head", # Language model head - best to set this trainable if chat fine-tuning
],
lora_dropout = 0,
bias = "none",
use_gradient_checkpointing = "unsloth",
random_state = 3407,
use_rslora = True,
)Set up Tokenizer and Padding
Before starting the fine-tuning process, it’s essential to configure the tokenizer and set up padding correctly. This ensures that the model can handle input sequences efficiently and that special tokens are properly managed.
Here is a step-by-step guide to setting up the tokenizer and padding:
Inspect the Tokenizer: Print out the tokenizer details, including the vocabulary size, beginning-of-sequence (BOS) token, end-of-sequence (EOS) token, and chat template.
Optionally Set the Chat Template Manually: If needed, you can manually set the chat template. This is useful for ensuring that the conversation starts correctly depending on the initial message role.
Apply the Chat Template: Use the chat template to format a list of messages.
Set the Pad Token: Determine the appropriate pad token based on the tokenizer’s vocabulary and set it accordingly.
Update the Model Configuration: Ensure that the model and its configuration are updated with the correct pad token ID.
Here is the code to perform these steps:
print(tokenizer)
print(tokenizer.vocab_size)print(tokenizer.bos_token)
print(tokenizer.eos_token)print(tokenizer.chat_template)A custom chat template for a tokenizer, specifically designed for Llama/Mistral models is created. This template ensures that conversations start correctly by conditionally adding a beginning-of-sequence token (bos_token) if the first message is not from the assistant. This is particularly useful when formatting chosen and rejected responses separately, as it avoids adding an extra bos_token before the response.
The template is defined using a Jinja-like syntax, which iterates through the messages and formats them based on their roles (user or assistant). For user messages, it wraps the content with [INST] and [/INST] tags, while for assistant messages, it appends an end-of-sequence token (eos_token).
tokenizer.chat_template = """{% if messages[0]['role'] != 'assistant' %}{{ bos_token }}{% endif %}{% for message in messages %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ message['content'] + eos_token }}{% endif %}{% endfor %}
"""
# Test chat template
messages = [
{'role': 'user', 'content': 'write a quick sorf algorithm in python.'},
{'role': 'assistant', 'content': 'here you are.'},
{'role': 'user', 'content': 'great.'},
]
inputs = tokenizer.apply_chat_template(messages, tokenize=False)
print(inputs)## set the pad token to <pad>, if not <|pad|>, if not <unk> if <unk>
if '<pad>' in tokenizer.get_vocab():
print('<pad> token is is in the tokenizer. Usinh <pad> for pad')
#Set the pad token
tokenizer.pad_token = '<pad>'
elif '<|pad|>' in tokenizer.get_vocab():
print('<|pad|> token is in the tokenizer. Using for <|pad|> for pad')
# Set the pad token
tokenizer.pad_token = '<|pad|>'
elif '<unk>' in tokenizer.get_vocab():
print('<unk> token is in the tokenizer. Using for <unk> for pad')
# Set the pad token
tokenizer.pad_token = '<unk>'
else:
print(f'Using EOS token, {tokenizer.eos_token}, for padding. Warning, this ')
tokenizer.pad_token = tokenizer.eos_token# Update pad token id in model and its config
model.pad_token_id = tokenizer.pad_token_id
model.config.pad_token_id = tokenizer.pad_token_id
# Check if they are equal
assert model.pad_token_id == tokenizer.pad_token_id, "The model's pat token ID are not equal"
# Print the pad token ids
print('Tokenizer pad token ID:', tokenizer.pad_token_id)
print('Model pad token ID:', model.pad_token_id)
print('Model config pad token ID:', model.config.pad_token_id)
print('Number of tokens now in tokenizer:', tokenizer.vocab_size)print('Special tokens map:', tokenizer.special_tokens_map)
print('All special tokens:', tokenizer.all_special_tokens)print(tokenizer)Set embed and norm layers to trainable (recommended for chat fine-tuning if chat template has been changed)
When fine-tuning a model for chat applications, it’s often beneficial to set specific layers to be trainable, especially if you are changing the chat template. This ensures that the model can adapt to the new input format more effectively.
Here is a step-by-step guide to setting specific layers to trainable:
Identify Trainable Parameters: Create a list of the names of the layers you want to set as trainable.
Set Modules to Trainable: Iterate through the model’s parameters and set the requires_grad attribute to True for the specified layers. Optionally, set the rest to False.
Create a Dictionary of Trainable Parameters: Collect the trainable parameters into a dictionary for easy access.
Convert to State Dict Format: Convert the trainable parameters to a state dictionary format, which can be useful for saving and loading the model’s state.
Print Trainable Parameters: Use a function to print the trainable parameters to verify the setup.
Here is the code to perform these steps:
# List to hold the names of the trainable parameters
trainable_params_names = ['embed_tokens', 'input_layernorm', 'post_attention_layernorm', 'norm']
# Set modules to be trainable
for n, p in model.named_parameters():
if any(k in n for k in trainable_params_names):
p.requires_grad_(True)
else:
p.requires_grad_(False) # Optional: Set the rest to be trainable
# Make a dictionary of trainable parameters
trainable_params = {n: p for n, p in model.named_parameters() if p.requires_grad}
# Convert trainable_params to state_dict format
trainable_params_state_dict = {n: p.data for n, p in trainable_params.items()}print_trainable_parameters(model)Loading and Preparing the Dataset for Fine-Tuning
In this code, we will guide you through the process of loading and preparing a dataset for fine-tuning a model. This involves loading the dataset, shuffling it, splitting it into training and test sets, and applying a specific template to format the data correctly.
Here is a step-by-step guide to loading and preparing the dataset:
Import Necessary Libraries: Import the required libraries, including json for handling JSON data and datasets for loading and manipulating the dataset.
Define Dataset Parameters: Set the dataset name and the maximum number of samples to use. If you want to use the full dataset, set max_num_samples to None.
Define the build_dataset Function: Create a function called build_dataset that takes a tokenizer, dataset name, cache directory, maximum number of samples, and other parameters as inputs. This function will load the dataset, shuffle it, split it into training and test sets, and apply a specific template to format the data.
Load the Dataset: Use the load_dataset function from the datasets library to load the dataset. The dataset is split based on the max_num_samples parameter.
Shuffle the Dataset: If max_num_samples is not None, shuffle the dataset to ensure randomness.
Split the Dataset: Determine the number of test samples and split the dataset into training and test sets using the train_test_split method.
Apply the DPO Template: Define a function called apply_dpo_template that formats the data according to the DPO (Direct Preference Optimization) template. This function extracts the necessary information from the dataset and applies the chat template using the tokenizer.
Map the Dataset: Use the map method to apply the apply_dpo_template function to the dataset. Remove the original columns and rename the new columns accordingly.
Return the Dataset: Return the training and test datasets.
Check the Chat Template: Ensure that the chat template is correctly applied and that special tokens are not included when tokenizing the responses.
Here is the code to perform these steps:
# Prepared with the help of code from: https://github.com/xfactlab/orpo/blob/main...
import json
# Load the dataset
dataset_name = 'llmat/dpo-orpo-mix-38k-balanced' # Ensure this is defined
max_num_samples = None # Set to None to use the full dataset
#max_num_samples = 10000 # set to None to use the full dataset
from datasets import load_dataset
def build_dataset(tokenizer, data_name, cache_dir=None, max_num_samples=10000, test_size_ratio=0.1):
# Determin the split specification based on max_num samples
split_spec = 'train' if max_num_samples is None else f'train[:{max_num_samples}]'
# Load the dataset
full_data = load_dataset(data_name, split=split_spec, cache_dir=cache_dir)
# Shuffle the dataset
if max_num_samples is not None:
full_data = full_data.shuffle(seed=42)
else:
full_data = full_data
# Determine the number of test samples
num_total_samples = len(full_data)
test_size = int(test_size_ratio * num_total_samples)
# Randomly split the data into training and test sets
dataset = full_data.train_test_split(test_size=test_size)
column_names = list(dataset['train'].features)
def apply_dpo_template(example):
# function adapted from https://kaitchup.substrack.com/p/fine-tune-a-better-go
if all(k in example.keys() for k in ('chosen', 'rejected')):
# For DPO, the inputs are triples of (prompt, chosen, rejected), where 'chosen'
# We therefore need to extract the N-1 turns to form the prompt
prompt_messages = example['chosen'][:-1]
example['messages'] = example['chosen']
# Now we extract the final turn to define chosen/rejected responses
chosen_messages = example['chosen'][-1:]
rejected_messages = example['rejected'][-1:]
example['text_chosen'] = tokenizer.apply_chat_template(chosen_messages, tokenize=False)
example['text_rejected'] = tokenizer.apply_chat_template(rejected_messages, tokenize=False)
example['text_prompt'] = tokenizer.apply_chat_template(prompt_messages, tokenize=False)
return example
dataset = dataset.map(apply_dpo_template, remove_columns=column_names,
desc='Formatting comparisons with prompt template',)
for split in ['train', 'test']:
dataset[split] = dataset[split].rename_columns(
{'text_prompt': 'prompt', 'text_chosen': 'chosen', 'text_rejected': 'rejected', 'messages': 'messages'}
)
return dataset['train'], dataset['test']
# Assuming 'tokenizer' and 'dataset_name' are already defined
train, test = build_dataset(tokenizer, dataset_name, cache_dir='./dataset', max_num_samples=max_num_samples)
# Check the chat template!!! <s> should not be included when tokenizing the responesAfter preparing and formatting your dataset for fine-tuning, it’s crucial to inspect the data to ensure that it has been correctly processed. This step helps you verify that the prompt, chosen, rejected, and messages fields are properly formatted and contain the expected information.
print('Prompt:', train['prompt'][0])
print('\n\nChosen:', train['chosen'][0])
print('\n\nRejected:', train['rejected'][0])
print('\n\nMessages (incl. prompt):', train['messages'][0])Setting Up and Running Training
In this tutorial, we will go through the process of setting up and running the training for your model. This includes configuring training parameters, creating a custom logging callback, and initiating the training process.
Here is a step-by-step guide to setting up and running the training:
Set Training Parameters: Define the training parameters such as the model name, number of epochs, gradient accumulation steps, batch size, and the directory to save the results.
Create a Custom Logging Callback: Implement a custom callback to log training metrics to a file. This callback will write the training and evaluation loss to a log file and save the trainable parameters at checkpoint steps.
Initialize the Logging Callback: Create an instance of the custom logging callback with the specified log file path.
Here is the code to perform these steps:
model_name = model_id.split('/')[-1]
epochs=1
grad_accum=4
batch_size=8
fine_tune_tag='ORPO'
save_dir = f'./results/{model_name}_{dataset_name}_{epochs}_epochs_{fine_tune_tag}'
print(save_dir)import transformers
import os
import torch
# Custom callback to log metrics
class LoggingCallback(transformers.TrainerCallback):
def __init__(self, log_file_path):
self.log_file_path = log_file_path
def on_log(self, args, state, control, model=None, logs=None, **kwargs):
with open(self.log_file_path, 'a') as f:
if 'loss' in logs:
f.write(f'Step: {state.global_step}, Training Loss: {logs["loss"]}\n')
if 'eval_loss' in logs:
f.write(f'Step: {state.global_step}, Eval Loss: {logs["eval_loss"]}\n')
f.flush() # Force flush the buffered data to file
# Check if the current step is a checkpoint step
if state.global_step % int(args.save_steps) == 0:
# Check if the last checkpoint path exists
if state.best_model_checkpoint:
checkpoint_dir = state.best_model_checkpoint
else:
# If not, construct the checkpoint directory path
checkpoint_dir = os.path.join(args.output_dir, f'checkpoint-{state.global_step}')
# Ensure the checkpoint directory exists
os.makedirs(checkpoint_dir, exist_ok=True)
# Save trainable params in the checkpoint directory
current_trainable_params = {n: p for n, p in model.named_parameters() if p.requires_grad}
current_trainable_params_state_dict = {n: p.data for n, p in current_trainable_params.items()}
file_path = os.path.join(checkpoint_dir, 'trainable_params.pt')
torch.save(current_trainable_params_state_dict, file_path)
# Log file path
cache_dir = './dataset' # Assuming cache_dir is defined elsewhere in your code
log_file_path = os.path.join(cache_dir, 'training_logs.txt')
# Create an instance of the custom callback
logging_callback = LoggingCallback(log_file_path)Setting Up ORPO Training
In this section, we’ll walk through setting up and training a model using the ORPOTrainer from the trl library.
I trained the model on the entire dataset (38k samples) using an RTX 4090 GPU (24 GB of VRAM). The training took 7 hours and 35 minutes. You can use smaller GPUs with less VRAM and a smaller batch size. In this case, I recommend only loading a subset of the dataset to speed up training. You can do it by modifying the previous code block, like ‘max_num_samples = 10000’ to only load 10k samples.
Configure ORPO
Define the configuration for the ORPO training. This configuration includes various hyperparameters and settings for training.
from trl import ORPOTrainer, ORPOConfig
from unsloth import is_bfloat16_supported
orpo_config = ORPOConfig(
beta=0.2,
save_steps=500,
logging_steps=1,
num_train_epochs=epochs,
output_dir=save_dir,
evaluation_strategy='steps',
do_eval=True,
eval_steps=0.2,
per_device_eval_batch_size=batch_size,
per_device_train_batch_size=batch_size,
gradient_accumulation_steps=grad_accum,
log_level='debug',
optim='paged_adamw_8bit',
fp16 = not is_bfloat16_supported(),
bf16 = is_bfloat16_supported(),
max_grad_norm=0.3,
lr_scheduler_type='linear',
warmup_ratio=0.03,
learning_rate=1e-4,
max_prompt_length=512,
max_length=1024,
max_completion_length=1024,
remove_unused_columns=True,
)Initialize ORPOTrainer
Create an instance of ORPOTrainer with the model, datasets, tokenizer, and the configuration defined earlier.
orpo_trainer = ORPOTrainer(
model,
args=orpo_config,
train_dataset=train,
eval_dataset=test,
tokenizer=tokenizer,
callbacks=[logging_callback], # Add custom callback here
)Train the Model
Set the model configuration to avoid cache warnings and start the training process.
model.config.use_cache = False # silence the warnings
orpo_trainer.train()Plotting Training and Evaluation Losses with Matplotlib
After training your model, it’s important to visualize the training and evaluation losses to understand how well your model is performing and to identify any potential issues. Visualizing the losses can help you diagnose problems such as overfitting or underfitting and make informed decisions about further training or model adjustments.
import matplotlib.pyplot as plt
# Initialize lists to hold training and evaluation losses and steps
train_losses = []
eval_losses = []
train_steps = []
eval_steps = []
# Populate the lists from the log history
for entry in orpo_trainer.state.log_history:
if 'loss' in entry:
train_losses.append(entry['loss'])
train_steps.append(entry['step'])
if 'eval_loss' in entry:
eval_losses.append(entry['eval_loss'])
eval_steps.append(entry['step'])
# Plot the losses
plt.plot(train_steps, train_losses, label='Train Loss')
plt.plot(eval_steps, eval_losses, label='Eval Loss')
plt.xlabel('Steps')
plt.ylabel('Loss')
plt.legend()
plt.show()Let’s now check the W&B plots. While the loss goes down, we also can see that the difference between the chosen and rejects answers becomes clearer.
Merging Adapters and Saving the Model to Hugging Face Hub
In the subsequent steps, we merge the adapters with the original model using 16-bit precision to enhance quality. Initially, we save it locally in the “model” directory before uploading it to the Hugging Face Hub. The trained model is available at llmat/Mistral-v0.3-7B-ORPO.
model.save_pretrained_merged("model", tokenizer, save_method="merged_16bit")
model.push_to_hub_merged("llmat/Mistral-v0.3-7B-ORPO", tokenizer, save_method="merged_16bit")